In this tutorial, we start with simple R code, and we then introduce basic statistics and probabilities. Utilizing R code and packages to learn statistics methods is the core of this webpage. (here’s the presentation)

  • R

  • Probability and Statistics

  • Data with R

  • Statistics with R

  • Model with R


WHY?

Why is it important to learn statistics, and why do we use R for data analysis? In the current era characterized by big data, extensive information, and artificial intelligence (AI), statistical models form the cornerstone of AI and machine learning. Statistics provide a comprehensive framework for explaining outcomes, parameters, and nearly all aspects of these models. By utilizing R, a powerful statistical computing language, researchers can efficiently analyze and interpret complex datasets, thereby advancing the field of data science.


1 R

J.D. Long, P. Teetor (2019) present a R cookbook by bookdown. Start with install R and its compiler R studio.


1.1 Get Start

The R studio page display 4 windows unless nav-bar, The upper left window is the opening file, I prefer utilize .Rmd (R mark down). As FIG 1., we code between the area and see the result below. You also could type the code behind the > on lower left window, and click enter to run the code. Two right windows illustrate the variable, function, and packages, we will introduce how they work later.

FIG 1
FIG 1

1.2 Variables

For the problem of using <- or =, we will see the same result for setting variables but my suggestion is <-.

x <- 100
x = 100
x
## [1] 100

1.2.1 integer

int <- 10
int
## [1] 10

1.2.2 character

chr <- "I am a character"
chr
## [1] "I am a character"

1.2.3 vector

Be careful the variable name, such as c is a function fro creating vectors.

V1 <- c(1:10)
V2 <- c("2", 2)
V1
##  [1]  1  2  3  4  5  6  7  8  9 10
V2
## [1] "2" "2"

1.2.4 function

\[ f(x)=x^2+2x+1\text{, and find } f(\pi). \]

f <- function(x){ x^2+2*x+1 }
f(pi)
## [1] 17.15279

Using ?function_name to see detail in th function.

1.2.5 factor

F1 <- factor(c(1,0,1,1,0,1))
F1
## [1] 1 0 1 1 0 1
## Levels: 0 1

1.2.6 list

List1 <- list(integer1 = int,
              character1 = chr,
              vector1 = V1,
              function1 = f,
              factor1 = F1)

List1
## $integer1
## [1] 10
## 
## $character1
## [1] "I am a character"
## 
## $vector1
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $function1
## function(x){ x^2+2*x+1 }
## 
## $factor1
## [1] 1 0 1 1 0 1
## Levels: 0 1

1.2.7 matrix

M1 <- matrix(c(1,2,1,
               0,2,7),
             ncol = 3,nrow = 2,byrow = T)
M1
##      [,1] [,2] [,3]
## [1,]    1    2    1
## [2,]    0    2    7
M2 <- matrix(c(1,0,0,
               0,1,0,
               0,0,1), ncol = 3, nrow = 3, byrow = T)
M2
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
M1%*%M2
##      [,1] [,2] [,3]
## [1,]    1    2    1
## [2,]    0    2    7

1.2.8 data frame

df <- data.frame(Name = c("Sam", "Yuji", "Lan", "Elaine"),
                 Age = c(28,24,NA,NA),
                 Height = c(177, 187, 159, 164),
                 Gender = factor(c("M", "M", "F", "F")))
df

1.3 Loop

When you repeat the same things many times.

1.3.1 for

buliding <- c()

for(Floor in 1:101){
  buliding <- rbind(Floor, buliding)
}

buliding
##       [,1]
## Floor  101
## Floor  100
## Floor   99
## Floor   98
## Floor   97
## Floor   96
## Floor   95
## Floor   94
## Floor   93
## Floor   92
## Floor   91
## Floor   90
## Floor   89
## Floor   88
## Floor   87
## Floor   86
## Floor   85
## Floor   84
## Floor   83
## Floor   82
## Floor   81
## Floor   80
## Floor   79
## Floor   78
## Floor   77
## Floor   76
## Floor   75
## Floor   74
## Floor   73
## Floor   72
## Floor   71
## Floor   70
## Floor   69
## Floor   68
## Floor   67
## Floor   66
## Floor   65
## Floor   64
## Floor   63
## Floor   62
## Floor   61
## Floor   60
## Floor   59
## Floor   58
## Floor   57
## Floor   56
## Floor   55
## Floor   54
## Floor   53
## Floor   52
## Floor   51
## Floor   50
## Floor   49
## Floor   48
## Floor   47
## Floor   46
## Floor   45
## Floor   44
## Floor   43
## Floor   42
## Floor   41
## Floor   40
## Floor   39
## Floor   38
## Floor   37
## Floor   36
## Floor   35
## Floor   34
## Floor   33
## Floor   32
## Floor   31
## Floor   30
## Floor   29
## Floor   28
## Floor   27
## Floor   26
## Floor   25
## Floor   24
## Floor   23
## Floor   22
## Floor   21
## Floor   20
## Floor   19
## Floor   18
## Floor   17
## Floor   16
## Floor   15
## Floor   14
## Floor   13
## Floor   12
## Floor   11
## Floor   10
## Floor    9
## Floor    8
## Floor    7
## Floor    6
## Floor    5
## Floor    4
## Floor    3
## Floor    2
## Floor    1

2 Probabilities and Statistics

Textbook


2.1 Probability Space

\[ \Large (\Omega, \mathscr{F}, P) \]

  • \(\Omega\): Sample space, all of the occurring events.

  • \(\mathscr{F}\): \(\sigma\)-algebra, hard to explained, simply define \(\Omega\in\mathscr{F}\), fixed the math.

  • \(P\): probability measure, the value of the occurring event, and between 0 and 1.


2.1.1 simply example

Probability space of tossing a fair coin, then:

  • \(\Omega = \{H,T\}\)

  • \(\mathscr{F} = \{H,T\}\)

  • \(P(\{H\})=P(\{T\})=\frac{1}{2}\)

2.1.2 Probability Measure

  • Three Axiom

    • \(0\leq P(A)\leq 1,\;\forall A\)

    • \(P(\Omega) = 1\)

    • \(P(\bigcup_{i=1}^n A_i)=\sum_{i=1}^n P(A_i), \text{where }A_i\cap A_j=\varnothing,\;\forall i,j\)


2.2 Discrete Random Variable

  • Use X,Y,Z,… define random variables, and x,y,z,.. define values.

  • Random Variables \(X\) includes all possible outcomes.

  • Probability is denoted by \(P(X=x)=f(x)\).

  • Mean: \(E(X) = \sum_{x=1}^nx_if(x) = \mu\)

  • variance:\(Var(X) = \sum_{x=1}^n(x-\mu)^2f(x)= E(X-E(X))^2=E(X^2)-(E(X))^2\)


2.2.1 Binomial

\[ \Large X\sim B(n,p) \]

  • \(n\) and \(p\) are parameters of distribution

  • \(n\): n times for trail

  • \(p\): probability of event for each trail

  • Probability Mass Function (pmf):

\[ f_X(x) = \binom{n}{x}\;p^xq^{n-x},\;\forall x = 1, \cdots, n \]


2.3 Continuous Random Variable

  • Probability is denoted by \(P(X\leq x)=\int_{-\infty}^xf(x)dx\), then \(P(X=x)=0\)

  • Mean: \(E(X) = \int_\mathbb {r class.output="code-output"}xf_X(x)dx = \mu\)

  • variance:\(Var(X) = \int_\mathbb {r class.output="code-output"}(x-\mu)^2f_X(x)dx = E(X-E(X))^2=E(X^2)-(E(X))^2\)


2.3.1 Exponential

  • \(X\sim\exp(\lambda)\)

  • Probability Density Function (pdf.) \(f_X(x) = \lambda e^{-\lambda x}\)

f <- function(x, lambda = 1){lambda*exp(-lambda*x)}
curve(f,from = 0, to = 5, col = 1)
curve(f(x,2),from = 0, to = 5, col = 2, add = TRUE)
curve(f(x,.5),from = 0, to = 5, col = 3, add = TRUE)


2.3.2 Normal

\[ \Large X\sim N(\mu, \sigma^2) \]

\[ f_X(x) = \cfrac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}},\;\forall\;x\in\mathbb``` {r class.output="code-output"} \]


2.3.3 t-distribution

  • \(X\sim t_n\)

  • \(t_n\overset{d}{=}\color {r class.output="code-output"ed}{N(0,1)}\;\) as \(\;n\rightarrow\infty\)

curve(dt(x,df=1),from=-4,to=4,col="#000000",ylim=c(0,0.4))
curve(dt(x,df=2),from=-4,to=4,col="#240000",add=T)
curve(dt(x,df=3),from=-4,to=4,col="#460000",add=T)
curve(dt(x,df=4),from=-4,to=4,col="#690000",add=T)
curve(dt(x,df=5),from=-4,to=4,col="#800000",add=T)
curve(dt(x,df=8),from=-4,to=4,col="#990000",add=T)
curve(dt(x,df=10),from=-4,to=4,col="#bb0000",add=T)
curve(dt(x,df=50),from=-4,to=4,col="#cc0000",add=T)
curve(dt(x,df=91),from=-4,to=4,col="#dd0000",add=T)
curve(dt(x,df=100),from=-4,to=4,col="#ee0000",add=T)
curve(dnorm(x,0,1),from=-4,to=4,col="#ff0000",add=T)


3 Data with R


3.1 Random Sample

\[ \mathcal{X}=(X_1, X_2,\cdots, X_{100}),\;\forall\;X_i\overset{iid}{\sim} N(160,10) \]

set.seed(19970608)
#random
x <- rnorm(n = 100, mean = 160, sd = sqrt(10))
sort(x, decreasing = FALSE)
##   [1] 150.2045 150.3638 151.9515 152.4867 153.2311 153.9550 155.0583 155.1287
##   [9] 155.4500 155.8637 156.0542 156.3821 156.4040 156.6807 156.8222 156.9162
##  [17] 156.9413 157.3930 157.4179 157.4487 157.4901 157.5814 157.6270 157.8093
##  [25] 158.1321 158.1779 158.1841 158.2011 158.3338 158.4286 158.5381 158.5747
##  [33] 158.6128 158.6130 158.6211 158.7004 158.8500 158.9164 158.9294 158.9382
##  [41] 158.9697 159.0128 159.0373 159.3363 159.4597 159.5536 159.6015 159.7461
##  [49] 159.8022 159.8833 159.8919 159.9494 159.9613 159.9639 160.1277 160.2413
##  [57] 160.2432 160.2516 160.2878 160.4307 160.5644 160.5740 160.7001 160.8122
##  [65] 160.8769 160.8916 160.9366 160.9986 161.0055 161.0284 161.0291 161.2953
##  [73] 161.3876 161.4904 161.5726 161.6253 161.6258 161.6970 161.7136 161.8119
##  [81] 162.3868 162.4849 162.5548 162.6081 163.2007 163.2842 163.4371 163.7381
##  [89] 163.8430 163.8762 164.2689 164.4121 164.4778 165.3108 165.5170 165.9530
##  [97] 166.4605 167.1191 167.4459 168.5680

3.2 Histogram

hist(x, nclass = 20)


3.3 Data table

library(kableExtra)
df <- read.csv(file = "csv/df.csv", fileEncoding = "Big5")
kable(df)
Name Height Weight Gender Age
Sam Lu 177 82 M 28
Lan Huang 159 45 F 21
Yuji Nishida 187 89 M 24
Elaine Liao 164 53 F 27

3.4 New Data (Male)

library(randomNames)
## Warning: package 'randomNames' was built under R version 4.4.1
set.seed(19970608)
New_df.M <- data.frame(Name = randomNames(n = 68, gender = 0, which.names = "first"),
                       Height = rnorm(68, 175, 10),
                       Weight = rnorm(68, 70, 10),
                       Gender = rep("M", 68),
                       Age = sample(20:40, size = 68, replace = TRUE))
New_df.M

3.5 New Data (Female)

library(randomNames)
set.seed(19970607)
New_df.F <- data.frame(Name = randomNames(n = 28, gender = 1, which.names = "first"),
                       Height = rnorm(28, 160, 10),
                       Weight = rnorm(28, 50, 10),
                       Gender = rep("F", 28),
                       Age = sample(20:40, size = 28, replace = TRUE))
New_df.F

3.6 New Data (n = 100)

New_df <- rbind(df, New_df.M, New_df.F) #cbind()
New_df

3.7 Data editing

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
New_df <- New_df%>%
  mutate(BMI = Weight/(Height/100)^2)%>%
  transform(Gender = as.factor(Gender),
            BMI = round(BMI, digits = 1))
New_df

3.8 Data Information

summary(New_df)
##      Name               Height          Weight      Gender      Age       
##  Length:100         Min.   :141.8   Min.   :33.26   F:30   Min.   :20.00  
##  Class :character   1st Qu.:164.2   1st Qu.:53.76   M:70   1st Qu.:25.00  
##  Mode  :character   Median :171.9   Median :62.46          Median :31.00  
##                     Mean   :171.1   Mean   :62.89          Mean   :30.21  
##                     3rd Qu.:178.5   3rd Qu.:72.78          3rd Qu.:35.00  
##                     Max.   :198.8   Max.   :96.98          Max.   :40.00  
##       BMI       
##  Min.   :10.50  
##  1st Qu.:18.38  
##  Median :21.40  
##  Mean   :21.48  
##  3rd Qu.:24.93  
##  Max.   :33.70

3.9 Data searching

New_df%>%filter(BMI<13)
New_df%>%filter(Height>190)
New_df%>%filter(Age==28)

4 Statistics with R


4.1 One-sample Student’s t Test

\[ H_0: \mu=170 \]

\[ H_1: \mu\neq 170 \]

t.test(New_df$Height, mu=170)
## 
##  One Sample t-test
## 
## data:  New_df$Height
## t = 0.9556, df = 99, p-value = 0.3416
## alternative hypothesis: true mean is not equal to 170
## 95 percent confidence interval:
##  168.8563 173.2688
## sample estimates:
## mean of x 
##  171.0625

4.2 One-sample Student’s t Test

\[ H_0: \mu>170 \]

\[ H_1: \mu\leq 170 \]

t.test(New_df$Height, mu=170, alternative = "greater") #alternative = "less", when H_0:\mu<170
## 
##  One Sample t-test
## 
## data:  New_df$Height
## t = 0.9556, df = 99, p-value = 0.1708
## alternative hypothesis: true mean is greater than 170
## 95 percent confidence interval:
##  169.2163      Inf
## sample estimates:
## mean of x 
##  171.0625

4.3 Two-sample Student’s t Test

\[ H_0: \mu_1<\mu_2 \]

\[ H_1: \mu_1\geq\mu_2 \]

BMI_M <- New_df%>%filter(Gender=="M")
BMI_M <- sample(BMI_M$BMI, 30, replace = F)
BMI_F <- New_df%>%filter(Gender=="F")
BMI_F <- BMI_F$BMI 

#test
t.test(BMI_M, BMI_F, alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  BMI_M and BMI_F
## t = 3.7572, df = 57.764, p-value = 0.9998
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 5.818209
## sample estimates:
## mean of x mean of y 
##  22.17333  18.14667

5 Model with R


5.1 Linear Model

\[ Y = \beta_0+\beta_1X+\varepsilon\;,\;\text{where }\;\varepsilon\sim N(0,1) \]

glm1 <- glm(Height~Weight, data = New_df)
summary(glm1)
## 
## Call:
## glm(formula = Height ~ Weight, data = New_df)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.72221    4.57094  32.974  < 2e-16 ***
## Weight        0.32341    0.07086   4.564 1.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 103.0028)
## 
##     Null deviance: 12240  on 99  degrees of freedom
## Residual deviance: 10094  on 98  degrees of freedom
## AIC: 751.24
## 
## Number of Fisher Scoring iterations: 2

5.2 Regression Plot

library(plotly)


est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
                      Weight = New_df$Weight)

plot_ly(New_df, x = ~Weight,
        y = ~Height,
        text = New_df$Name,
        type = 'scatter',
        alpha = 0.65,
        mode = 'markers',
        name = 'Weight-Height')%>%
  add_trace(data = New_est,
            x = ~Weight,
            y = ~Height,
            name = 'Regression Fit',
            mode = 'lines')

5.3 Regression Plot (Gender)

library(plotly)

est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
                      Weight = New_df$Weight)

plot_ly(New_df,
        x = ~Weight,
        y = ~Height,
        text = New_df$Name,
        color = ~New_df$Gender,
        type = 'scatter',
        alpha = 0.65, mode = 'markers',
        name = 'Weight-Height')%>%
  add_trace(data = New_est,
            x = ~Weight,
            y = ~Height,
            name = 'Regression Fit',
            mode = 'lines')

5.4 Regression Plot (Age)

library(plotly)

est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
                      Weight = New_df$Weight)

plot_ly(New_df, x = ~Weight,
        y = ~Height,
        text = New_df$Name,
        color = ~New_df$Age,
        type = 'scatter',
        alpha = 0.65,
        mode = 'markers',
        name = 'Weight-Height')%>%
  add_trace(data = New_est,
            x = ~Weight,
            y = ~Height,
            name = 'Regression Fit',
            mode = 'lines')

5.5 Statistical meaning

  • Model: \(Y=X\beta+\varepsilon\), for \(\varepsilon\sim N(0,1)\)

  • Estimate \(\beta_0\), \(\beta_1\)

\[ Y = \begin{pmatrix} Y_1\\ \vdots\\ Y_n \end{pmatrix} ,\; X = \begin{pmatrix} 1 & X_1\\ \vdots & \vdots\\ 1 & X_n \end{pmatrix} ,\;\text{and }\; \beta = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \]

  • \(\hat{\beta} = \begin{pmatrix} \hat{\beta_0}\\ \hat{\beta_1} \end{pmatrix}\) is LSE of \(\beta\)

5.5.1 LSE

  • LSE (Least Squared Estimation) to estimate \(\beta\)

  • Estimation of \(Y\): \(\hat{Y}=X\hat{\beta}\)

  • Minimized:

\[ (Y-X\hat{\beta})^T(Y-X\hat{\beta}) \]

  • Solve: (maximum and minimum for the function while the derivative equal to 0.)

\[ \cfrac{\partial}{\partial\hat{\beta}} (Y-X\hat{\beta})^T(Y-X\hat{\beta})=0 \]

  • We have:

\[ \hat{\beta} = (X^TX)^{-1}X^TY \]


6 QA


6.1 ERROR


6.1.1 error: can not find the object ‘x’

Here’s the problem for undefined variable ‘x’.


6.6.2 error: the function ‘f’ not found.

Check the function exists or the package with this function should be installed. Please use ?f to find the function detail. If you use the function f in package called Basic, then install the package and run the code library("Basic"). You then use the function.


6.6.3 wrong data type

Sometimes, the package require numeric only, you could use as.numeric() to change its type and utilize class() to check its type.

c1 <- c(1,2,5,7,"1","6")

# turn into numeric
c1 <- as.numeric(c1)

#check
class(c1)
## [1] "numeric"